ML Cheat Sheet
- Regression (Predicting a Number) Used when the output is a continuous value (e.g., price, temperature, stock value).
- Linear Regression: Fits a straight line to data.
- Example: Predicting the price of a house based on square footage.
- Syntax: from pyspark.ml.regression import LinearRegression lr = LinearRegression(featuresCol="features", labelCol="price") model = lr.fit(train_data
- Classification (Predicting a Category) Used when the output is a discrete label (e.g., Yes/No, Red/Blue/Green).
- Logistic Regression: Despite the name, it's for classification. Predicts the probability of a class.
- Example: Predicting if a customer will "churn" (leave) or stay.
- Syntax: from pyspark.ml.classification import LogisticRegression
- Naive Bayes: Based on Bayes' Theorem. Assumes features are independent.
- Example: Email Spam filtering or Sentiment Analysis (Positive/Negative).
- Syntax: from pyspark.ml.classification import NaiveBayes
- Random Forest: A "forest" of many decision trees. Very robust and popular.
- Example: Predicting if a loan application is "High Risk" or "Low Risk."
- Syntax: from pyspark.ml.classification import RandomForestClassifier
- Clustering (Finding Hidden Groups) Unsupervised learning; the data has no labels. The computer finds patterns on its own.
- K-Means: Groups data points into 'K' number of clusters based on similarity.
- Example: Grouping users by their "shopping persona" (e.g., Bargain Hunters vs. Luxury Buyers).
- Syntax: from pyspark.ml.clustering import KMeans kmeans = KMeans(k=5, seed=1) model = kmeans.fit(dataset)
- Time Series (Predicting the Future) Used for data ordered by time (daily sales, hourly sensor readings).
- ARIMA / SARIMA:
- ARIMA: (AutoRegressive Integrated Moving Average) looks at past values and past errors.
- SARIMA: Includes Seasonality (e.g., sales always spike in December).
- Example: Forecasting electricity demand for the next 48 hours.
- Note: PySpark MLlib doesn't have a native ARIMA. You typically use statsmodels inside a Spark pandas_udf to run it in parallel.
- Syntax (Statsmodels): from statsmodels.tsa.statespace.sarimax import SARIMAX model = SARIMAX(data, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12)) results = model.fit()
- Recommendation Engines
- ALS (Alternating Least Squares): A type of Collaborative Filtering.
- Example: Netflix "Because you watched..." or Amazon "Users who bought this also bought..."
- Syntax: from pyspark.ml.recommendation import ALS als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating") model = als.fit(train_data)
Quick Comparison Table Technique Goal Type Library (PySpark) Linear Regression Predict a number Supervised pyspark.ml.regression Logistic Regression Predict a category (0 or 1) Supervised pyspark.ml.classification Naive Bayes Classify text/labels Supervised pyspark.ml.classification K-Means Find hidden groups Unsupervised pyspark.ml.clustering ARIMA/SARIMA Forecast future time steps Stats/Time Series statsmodels (via Pandas UDF) Random Forest High-accuracy classification Supervised pyspark.ml.classification The PySpark Workflow Pattern In PySpark, almost every ML task follows this same 3-step pattern:
- VectorAssembler: Combine your feature columns into a single "features" vector column.
- Fit: Train the model: model = algorithm.fit(df).
- Transform: Make predictions: predictions = model.transform(new_df).
ML Frameworks: scikit-learn, TensorFlow, PyTorch, and More
scikit-learn (sklearn)
- Best for: Tabular data, classical ML (regression, classification, clustering, preprocessing, pipelines).
- Strengths: Simple API, fast prototyping, tons of built-in models and metrics, great for interviews and competitions.
- Limitations: Not for deep learning, limited GPU support, not for large-scale distributed training.
- Typical workflow:
- Preprocess (LabelEncoder, OneHotEncoder, StandardScaler, etc.)
- Split data (train_test_split)
- Fit model (model.fit)
- Predict/evaluate (model.predict, accuracy_score, confusion_matrix)
TensorFlow (and Keras)
- Best for: Deep learning (neural networks, CNNs, RNNs, transformers), large-scale data, production deployment.
- Strengths: GPU/TPU support, scalable, flexible, Keras API is user-friendly, used in industry for image, text, audio, tabular, and time series.
- Limitations: More complex than sklearn for simple tasks, steeper learning curve for custom models.
- Typical workflow:
- Define model (Sequential or Functional API)
- Compile (optimizer, loss, metrics)
- Fit (model.fit)
- Evaluate/predict (model.evaluate, model.predict)
PyTorch
- Best for: Deep learning research, custom neural networks, NLP, computer vision, academic work.
- Strengths: Dynamic computation graph (easier debugging), Pythonic, strong community, used in research and production.
- Limitations: Slightly more code for basic tasks than Keras, but more flexible for advanced models.
- Typical workflow:
- Define model (nn.Module)
- Define loss/optimizer
- Training loop (forward, backward, step)
- Evaluate/predict
Other ML Libraries
- XGBoost/LightGBM/CatBoost:
- Specialized for gradient boosting (tabular data, competitions, high accuracy)
- Often outperform sklearn’s GradientBoostingClassifier/Regressor
- Used via their own API or sklearn wrappers
- statsmodels:
- For statistical models (linear regression, ARIMA, time series, GLM)
- More statistical tests, summary tables, p-values
- spaCy/NLTK:
- For NLP tasks (tokenization, parsing, entity recognition)
- Prophet:
- For time series forecasting (easy API, handles seasonality/holidays)
When to Use What?
| Framework | Best For | Not For |
|---|---|---|
| scikit-learn | Tabular/classical ML, fast protos | Deep learning, images |
| TensorFlow | Deep learning, production, scale | Small tabular problems |
| PyTorch | Deep learning research, NLP, CV | Simple tabular ML |
| XGBoost/LGBM | Tabular, competitions, accuracy | Deep learning, images |
| statsmodels | Statistical analysis, time series | Deep learning |
| spaCy/NLTK | NLP preprocessing, pipelines | Tabular, vision |
| Prophet | Time series forecasting | Classification |
General Advice
- Use scikit-learn for most tabular ML tasks, interviews, and quick experiments.
- Use TensorFlow/Keras or PyTorch for deep learning (images, text, audio, complex models).
- Use XGBoost/LightGBM for tabular data when you want the best accuracy (after trying sklearn models).
- Use statsmodels for statistical analysis, time series, and when you need interpretability (p-values, confidence intervals).
- For NLP, use spaCy for pipelines, NLTK for classic NLP, and transformers (HuggingFace) for state-of-the-art models.
Example: Keras Neural Network (TensorFlow)
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32)
model.evaluate(X_test, y_test)
Example: PyTorch Neural Network
import torch
import torch.nn as nn
import torch.optim as optim
class Net(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.fc1 = nn.Linear(input_dim, 64)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(64, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.sigmoid(self.fc2(x))
return x
model = Net(X.shape[1])
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
# Training loop omitted for brevity
Practical ML Model Cheat Sheet (Interview/Assessment Prep)
General ML Workflow (Tabular Data)
- Load Data: Read your data into a DataFrame.
- Define Features/Target: Select feature columns (X) and target column (y).
- Preprocess: Encode non-numeric data if needed (LabelEncoder, OneHotEncoder).
- Split: Use train_test_split for train/test sets.
- Fit: Train your model (fit on X_train, y_train).
- Score/Evaluate: Use score(), accuracy_score, or other metrics on X_test, y_test.
Random Forest (RF) vs Gradient Boosting (GB)
Random Forest:
- Ensemble of many decision trees, built in parallel (bagging).
- Robust, less prone to overfitting, fast to train, fewer hyperparameters.
- Good baseline for tabular data.
Gradient Boosting:
- Ensemble of trees built sequentially, each correcting the previous (boosting).
- More tunable parameters (learning rate, n_estimators, etc.), can overfit if not tuned.
- Often achieves higher accuracy if tuned well, but slower to train.
When to use:
- Try RF for a quick, robust baseline.
- Use GB if you want to push for best accuracy and can tune parameters.
Typical usage (scikit-learn):
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
gb.score(X_test, y_test)
Naive Bayes
What is it?
- Probabilistic classifier based on Bayes’ theorem, assumes features are independent given the class.
- Very fast, works well for text classification (spam, sentiment, etc.).
Types:
- GaussianNB: for continuous features.
- MultinomialNB: for counts (text, word counts).
- BernoulliNB: for binary features.
When to use:
- Text classification, high-dimensional data, simple/fast baseline.
Typical usage (text):
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = MultinomialNB()
model.fit(X, y)
model.score(X, y)
Metrics & Confusion Matrix
Confusion Matrix:
- For binary classification: [[TN, FP], [FN, TP]]
- Recall: TP / (TP + FN)
- False Positive Rate: FP / (FP + TN)
Accuracy: (TP + TN) / (TP + TN + FP + FN)
PCA vs LDA
PCA (Principal Component Analysis):
- Unsupervised, reduces dimensionality by maximizing variance.
- Use when you want to reduce features, handle multicollinearity, or don’t have labels.
LDA (Linear Discriminant Analysis):
- Supervised, reduces dimensionality by maximizing class separability.
- Use when you want to separate classes and have labels.
General Tips
- All features for most ML models must be numeric (encode categorical/text as needed).
- For text targets (not categories), use NLP/deep learning models (not RF/GB/NB).
- Always preprocess test data with the same steps/vectorizer as training data.
Example: Naive Bayes Text Classification
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
texts = ["I love ML", "ML is great", "I hate spam", "spam is bad"]
labels = [1, 1, 0, 0]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = MultinomialNB()
model.fit(X, labels)
model.predict(vectorizer.transform(["payment plan"]))
Model Selection Table
| Model | Use Case | Data Type | Pros | Cons |
|---|---|---|---|---|
| Linear Regression | Predict a number | Numeric | Simple, interpretable | Only linear relationships |
| Logistic Regression | Predict a category (0/1) | Numeric/categorical | Probabilities, fast | Only linear boundaries |
| Naive Bayes | Text/category classification | Text/categorical | Fast, works for text | Strong independence assumption |
| Random Forest | Classification/regression | Tabular | Robust, less overfitting | Slower, less interpretable |
| Gradient Boosting | Classification/regression | Tabular | High accuracy, flexible | Slow, needs tuning |
| K-Means | Clustering | Numeric | Unsupervised, simple | Needs k, only spherical clusters |
| ARIMA/SARIMA | Time series forecasting | Time series | Handles trends/seasonality | Needs stationary data |